Data processing system, method and interconnect fabric for selective link information allocation in a data processing system

ABSTRACT

A data processing system includes a plurality of processing units coupled for communication by a communication link and a configuration register. The configuration register has a plurality of different settings each corresponding to a respective one of a plurality of different link information allocations. Information is communicated over the communication link in accordance with a particular link information allocation among the plurality of link information allocations determined by a respective setting of the configuration register.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 11/055,366, filed on Feb. 10, 2005, entitled “Data ProcessingSystem, Method and Interconnect Fabric for Selective Link InformationAllocation in a Data Processing System”, which is also related to thefollowing copending applications, which are assigned to the assignee ofthe present invention and incorporated herein by reference in theirentireties:

(1) U.S. patent application Ser. No. 11/055,467;

(2) U.S. patent application Ser. No. 11/055,297; and

(3) U.S. patent application Ser. No. 11/055,299.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates in general to data processing systems and,in particular, to an improved interconnect fabric for data processingsystems.

2. Description of the Related Art

A conventional symmetric multiprocessor (SMP) computer system, such as aserver computer system, includes multiple processing units all coupledto a system interconnect, which typically comprises one or more address,data and control buses. Coupled to the system interconnect is a systemmemory, which represents the lowest level of volatile memory in themultiprocessor computer system and which generally is accessible forread and write access by all processing units. In order to reduce accesslatency to instructions and data residing in the system memory, eachprocessing unit is typically further supported by a respectivemulti-level cache hierarchy, the lower level(s) of which may be sharedby one or more processor cores.

As the clock frequencies at which processing units are capable ofoperating have risen and system scales have increased, the latency ofcommunication between processing units via the system interconnect hasbecome a critical performance concern. To address this performanceconcern, various interconnect designs have been proposed and/orimplemented that are intended to improve performance and scalabilityover conventional bused interconnects.

SUMMARY OF THE INVENTION

The present invention provides an improved data processing system,interconnect fabric and method of communication in a data processingsystem.

In one embodiment, a data processing system includes a plurality ofprocessing units each having a respective point-to-point communicationlink with each of multiple others of the plurality of processing unitsbut fewer than all of the plurality of processing units. Each of theplurality of processing units includes interconnect logic, coupled toeach point-to-point communication link of that processing unit, thatbroadcasts operations received from one of the multiple others of theplurality of processing units to one or more of the plurality ofprocessing units.

In another embodiment, a data processing system includes a plurality ofprocessing units each having a respective point-to-point communicationlink with each of multiple others of the plurality of processing unitsbut fewer than all of the plurality of processing units. Each of theplurality of processing units includes interconnect logic, coupled toeach point-to-point communication link of that processing unit, thatbroadcasts requests received from one of the multiple others of theplurality of processing units to one or more of the plurality ofprocessing units. The interconnect logic includes a partial responsedata structure including a plurality of entries each associating apartial response field with a plurality of flags respectively associatedwith each processing unit containing a snooper from which thatprocessing unit will receive a partial response. The interconnect logicaccumulates partial responses of processing units by reference to thepartial response field to obtain an accumulated partial response, andwhen the plurality of flags indicate that all processing units fromwhich partial responses are expected have returned a partial response,outputs the accumulated partial response.

In yet another embodiment, a data processing system includes a pluralityof processing units, including at least a local master and a local hub,which are coupled for communication via a communication link. The localmaster includes a master capable of initiating an operation, a snoopercapable of receiving an operation, and interconnect logic coupled to acommunication link coupling the local master to the local hub. Theinterconnect logic includes request logic that synchronizes internaltransmission of a request of the master to the snooper withtransmission, via the communication link, of the request to the localhub.

In yet another embodiment, a data processing system includes a pluralityof processing units coupled for communication by a communication linkand a configuration register. The configuration register has a pluralityof different settings each corresponding to a respective one of aplurality of different link information allocations. Information iscommunicated over the communication link in accordance with a particularlink information allocation among the plurality of link informationallocations determined by a respective setting of the configurationregister.

All objects, features, and advantages of the present invention willbecome apparent in the following detailed written description.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. However, the invention, as well as apreferred mode of use, will best be understood by reference to thefollowing detailed description of an illustrative embodiment when readin conjunction with the accompanying drawings, wherein:

FIG. 1 is a high level block diagram of a processing unit in accordancewith the present invention;

FIG. 2 is a high level block diagram of an exemplary data processingsystem in accordance with the present invention;

FIG. 3 is a time-space diagram of an exemplary operation including arequest phase, a partial response phase and a combined response phase;

FIG. 4 is a time-space diagram of an exemplary operation within the dataprocessing system of FIG. 2;

FIGS. 5A-5C depict the information flow of the exemplary operationdepicted in FIG. 4;

FIGS. 5D-5E depict an exemplary data flow for an exemplary operation inaccordance with the present invention;

FIG. 6 is a time-space diagram of an exemplary operation, illustratingthe timing constraints of an arbitrary data processing system topology;

FIGS. 7A-7B illustrate a first exemplary link information allocation forthe first and second tier links in accordance with the presentinvention;

FIG. 7C is an exemplary embodiment of a partial response field for awrite request that is included within the link information allocation;

FIGS. 8A-8B depict a second exemplary link information allocation forthe first and second tier links in accordance with the presentinvention;

FIG. 9 is a block diagram illustrating a portion of the interconnectlogic of FIG. 1 utilized in the request phase of an operation;

FIG. 10 is a more detailed block diagram of the local hub address launchbuffer of FIG. 9;

FIG. 11 is a more detailed block diagram of the tag FIFO queues of FIG.9;

FIGS. 12A and 12B are more detailed block diagrams of the local hubpartial response FIFO queue and remote hub partial response FIFO queueof FIG. 9, respectively;

FIG. 13 is a time-space diagram illustrating the tenures of an operationwith respect to the data structures depicted in FIG. 9;

FIG. 14A-14D are flowcharts respectively depicting the request phase ofan operation at a local master, local hub, remote hub, and remote leaf;

FIG. 14E is a high level logical flowchart of an exemplary method ofgenerating a partial response at a snooper in accordance with thepresent invention;

FIG. 15 is a block diagram illustrating a portion of the interconnectlogic of FIG. 1 utilized in the partial response phase of an operation;

FIG. 16A-16D are flowcharts respectively depicting the partial responsephase of an operation at a remote leaf, remote hub, local hub, and localmaster;

FIG. 17 is a block diagram illustrating a portion of the interconnectlogic of FIG. 1 utilized in the combined response phase of an operation;

FIG. 18A-18D are flowcharts respectively depicting the combined responsephase of an operation at a local master, local hub, remote hub, andremote leaf;

FIG. 19 is a block diagram depicting a portion of the interconnect logicof FIG. 1 utilized in the data phase of an operation; and

FIGS. 20A-20C are flowcharts respectively depicting the data phase of anoperation at the processing unit containing the data source, at aprocessing unit receiving data from another processing unit in its sameprocessing node, and at a processing unit receiving data from aprocessing unit in another processing node.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT I. Processing Unit andData Processing System

With reference now to the figures and, in particular, with reference toFIG. 1, there is illustrated a high level block diagram of an exemplaryembodiment of a processing unit 100 in accordance with the presentinvention. In the depicted embodiment, processing unit 100 is a singleintegrated circuit including two processor cores 102 a, 102 b forindependently processing instructions and data. Each processor core 102includes at least an instruction sequencing unit (ISU) 104 for fetchingand ordering instructions for execution and one or more execution units106 for executing instructions. The instructions executed by executionunits 106 may include, for example, fixed and floating point arithmeticinstructions, logical instructions, and instructions that request readand write access to a memory block.

The operation of each processor core 102 a, 102 b is supported by amulti-level volatile memory hierarchy having at its lowest level one ormore shared system memories 132 (only one of which is shown in FIG. 1)and, at its upper levels, one or more levels of cache memory. Asdepicted, processing unit 100 includes an integrated memory controller(IMC) 124 that controls read and write access to a system memory 132 inresponse to requests received from processor cores 102 a, 102 b andoperations snooped on an interconnect fabric (described below) by asnooper 126.

In the illustrative embodiment, the cache memory hierarchy of processingunit 100 includes a store-through level one (L1) cache 108 within eachprocessor core 102 a, 102 b and a level two (L2) cache 110 shared by allprocessor cores 102 a, 102 b of the processing unit 100. L2 cache 110includes an L2 array and directory 114, masters 112 and snoopers 116.Masters 112 initiate transactions on the interconnect fabric and accessL2 array and directory 114 in response to memory access (and other)requests received from the associated processor cores 102 a, 102 b.Snoopers 116 detect operations on the interconnect fabric, provideappropriate responses, and perform any accesses to L2 array anddirectory 114 required by the operations. Although the illustrated cachehierarchy includes only two levels of cache, those skilled in the artwill appreciate that alternative embodiments may include additionallevels (L3, L4, etc.) of on-chip or off-chip in-line or lookaside cache,which may be fully inclusive, partially inclusive, or non-inclusive ofthe contents the upper levels of cache.

As further shown in FIG. 1, processing unit 100 includes integratedinterconnect logic 120 by which processing unit 100 may be coupled tothe interconnect fabric as part of a larger data processing system. Inthe depicted embodiment, interconnect logic 120 supports an arbitrarynumber N of “first tier” interconnect links, which in this case includein-bound and out-bound X, Y and Z links. Interconnect logic 120 furthersupports two second tier links, designated in FIG. 1 as in-bound andout-bound A links. With these first and second tier links, eachprocessing unit 100 may be coupled for bi-directional communication toup to N/2+1 (in this case, four) other processing units 100.Interconnect logic 120 includes request logic 121 a, partial responselogic 121 b, combined response logic 121 c and data logic 121 d forprocessing and forwarding information during different phases ofoperations.

Each processing unit 100 further includes an instance of response logic122, which implements a portion of a distributed coherency signalingmechanism that maintains cache coherency between the cache hierarchy ofprocessing unit 100 and those of other processing units 100. Finally,each processing unit 100 includes an integrated I/O (input/output)controller 128 supporting the attachment of one or more I/O devices,such as I/O device 130. I/O controller 128 may issue operations andreceive data on the X, Y, Z and A links in response to requests by I/Odevice 130.

Referring now to FIG. 2, there is depicted a block diagram of anexemplary embodiment of a data processing system 200 formed of multipleprocessing units 100 in accordance with the present invention. As shown,data processing system 200 includes four processing nodes 202 a-202 d,which in the depicted embodiment, are each realized as a multi-chipmodule (MCM) comprising a package containing four processing units 100.The processing units 100 within each processing node 202 are coupled forpoint-to-point communication by the processing units' X, Y, and Z links,as shown. Each processing unit 100 may be further coupled to aprocessing unit 100 in a different processing node 202 forpoint-to-point communication by the processing units' A links. Althoughillustrated in FIG. 2 with a double-headed arrow, it should beunderstood that each pair of X, Y, Z and A links are preferably (but notnecessarily) implemented as two unidirectional links, rather than as abi-directional link.

A general expression for forming the topology shown in FIG. 2 can begiven as follows:

For all W≠V, Node[V].chip[W].link_A connects to Node[W].chip[V].link_A,

-   -   where V and W belong to the set {a,b,c,d}

For each W=V, Node[V].chip[W].link_A may connect to either:

-   -   1) nothing; or    -   2) Node[extra].chip[W].link_A, in which case all links are fully        utilized to construct a data processing system of five 8-way        nodes        Of course, alternative expressions can be defined to form other        functionally equivalent topologies. Moreover, it should be        appreciated that the depicted topology is representative but not        exhaustive of data processing system topologies embodying the        present invention and that other topologies are possible. In        such alternative topologies, for example, the number of first        tier links coupled to each processing unit 100 can be an        arbitrary number N (meaning that the number of processing units        100 in each processing node 202 is N/2+1) and the number of        processing nodes 202 need not equal the number of processing        units 100 per processing node 100.

Those skilled in the art will appreciate that SMP data processing system100 can include many additional unillustrated components, such asinterconnect bridges, non-volatile storage, ports for connection tonetworks or attached devices, etc. Because such additional componentsare not necessary for an understanding of the present invention, theyare not illustrated in FIG. 2 or discussed further herein.

II. Exemplary Operation

Referring now to FIG. 3, there is depicted a time-space diagram of anexemplary operation on the interconnect fabric of data processing system200 of FIG. 2. The operation begins when a master 300 (e.g., a master112 of an L2 cache 110 or a master within an I/O controller 128) issuesa request 302 on the interconnect fabric. Request 302 preferablyincludes a transaction type indicating a type of desired access and aresource identifier (e.g., real address) indicating a resource to beaccessed by the request. Common types of requests preferably includethose set forth below in Table I.

TABLE I Request Description READ Requests a copy of the image of amemory block for query purposes RWITM Requests a unique copy of theimage of a memory block (Read-With- with the intent to update (modify)it and requires Intent- destruction of other copies, if any To-Modify)DCLAIM Requests authority to promote an existing query-only (Data copyof memory block to a unique copy with the intent Claim) to update(modify) it and requires destruction of other copies, if any DCBZRequests authority to create a new unique copy of (Data Cache a memoryblock without regard to its present state Block Zero) and subsequentlymodify its contents; requires destruction of other copies, if anyCASTOUT Copies the image of a memory block from a higher level of memoryto a lower level of memory in preparation for the destruction of thehigher level copy WRITE Requests authority to create a new unique copyof a memory block without regard to its present state and immediatelycopy the image of the memory block from a higher level memory to a lowerlevel memory in preparation for the destruction of the higher level copyPARTIAL Requests authority to create a new unique copy of WRITE apartial memory block without regard to its present state and immediatelycopy the image of the partial memory block from a higher level memory toa lower level memory in preparation for the destruction of the higherlevel copy

Further details regarding these operations and an exemplary cachecoherency protocol that facilitates efficient handling of theseoperations may be found in copending U.S. patent application Ser. No.10/______ (Docket No. AUS920040802US1), which is incorporated herein byreference.

Request 302 is received by snoopers 304, for example, snoopers 116 of L2caches 110 and snoopers 126 of IMCs 124, distributed throughout dataprocessing system 200. In general, with some exceptions, snoopers 116 inthe same L2 cache 110 as the master 112 of request 302 do not snooprequest 302 (i.e., there is generally no self-snooping) because arequest 302 is transmitted on the interconnect fabric only if therequest 302 cannot be serviced internally by a processing unit 100.Snoopers 304 that receive and process requests 302 each provide arespective partial response 306 representing the response of at leastthat snooper 304 to request 302. A snooper 126 within an IMC 124determines the partial response 306 to provide based, for example, uponwhether the snooper 126 is responsible for the request address andwhether it has resources available to service the request. A snooper 116of an L2 cache 110 may determine its partial response 306 based on, forexample, the availability of its L2 cache directory 114, theavailability of a snoop logic instance within snooper 116 to handle therequest, and the coherency state associated with the request address inL2 cache directory 114.

The partial responses 306 of snoopers 304 are logically combined eitherin stages or all at once by one or more instances of response logic 122to determine a system-wide combined response (CR) 310 to request 302. Inone preferred embodiment, which will be assumed hereinafter, theinstance of response logic 122 responsible for generating combinedresponse 310 is located in the processing unit 100 containing the master300 that issued request 302. Response logic 122 provides combinedresponse 310 to master 300 and snoopers 304 via the interconnect fabricto indicate the system-wide response (e.g., success, failure, retry,etc.) to request 302. If the CR 310 indicates success of request 302, CR310 may indicate, for example, a data source for a requested memoryblock, a cache state in which the requested memory block is to be cachedby master 300, and whether “cleanup” operations invalidating therequested memory block in one or more L2 caches 110 are required.

In response to receipt of combined response 310, one or more of master300 and snoopers 304 typically perform one or more operations in orderto service request 302. These operations may include supplying data tomaster 300, invalidating or otherwise updating the coherency state ofdata cached in one or more L2 caches 110, performing castout operations,writing back data to a system memory 132, etc. If required by request302, a requested or target memory block may be transmitted to or frommaster 300 before or after the generation of combined response 310 byresponse logic 122.

In the following description, the partial response 306 of a snooper 304to a request 302 and the operations performed by the snooper 304 inresponse to the request 302 and/or its combined response 310 will bedescribed with reference to whether that snooper is a Highest Point ofCoherency (HPC), a Lowest Point of Coherency (LPC), or neither withrespect to the request address specified by the request. An LPC isdefined herein as a memory device or I/O device that serves as therepository for a memory block. In the absence of a HPC for the memoryblock, the LPC holds the true image of the memory block and hasauthority to grant or deny requests to generate an additional cachedcopy of the memory block. For a typical request in the data processingsystem embodiment of FIGS. 1 and 2, the LPC will be the memorycontroller 124 for the system memory 132 holding the referenced memoryblock. An HPC is defined herein as a uniquely identified device thatcaches a true image of the memory block (which may or may not beconsistent with the corresponding memory block at the LPC) and has theauthority to grant or deny a request to modify the memory block.Descriptively, the HPC may also provide a copy of the memory block to arequestor in response to an operation that does not modify the memoryblock. Thus, for a typical request in the data processing systemembodiment of FIGS. 1 and 2, the HPC, if any, will be an L2 cache 110.Although other indicators may be utilized to designate an HPC for amemory block, a preferred embodiment of the present invention designatesthe HPC, if any, for a memory block utilizing selected cache coherencystate(s) within the L2 cache directory 114 of an L2 cache 110.

Still referring to FIG. 3, the HPC, if any, for a memory blockreferenced in a request 302, or in the absence of an HPC, the LPC of thememory block, preferably has the responsibility of protecting thetransfer of ownership of a memory block in response to a request 302during a protection window 312 a. In the exemplary scenario shown inFIG. 3, a snooper 304 at the HPC (or in the absence of an HPC, the LPC)for the memory block specified by the request address of request 302protects the transfer of ownership of the requested memory block tomaster 300 during a protection window 312 a that extends from the timethat snooper 304 determines its partial response 306 until snooper 304receives combined response 310. During protection window 312 a, snooper304 protects the transfer of ownership by providing partial responses306 to other requests specifying the same request address that preventother masters from obtaining ownership (e.g., a retry partial response)until ownership has been successfully transferred to master 300. Master300 likewise initiates a protection window 312 b to protect itsownership of the memory block requested in request 302 following receiptof combined response 310.

Because snoopers 304 all have limited resources for handling the CPU andI/O requests described above, several different levels of partialresponses and corresponding CRs are possible. For example, if a snooper126 within a memory controller 124 that is responsible for a requestedmemory block has a queue available to handle a request, the snooper 126may respond with a partial response indicating that it is able to serveas the LPC for the request. If, on the other hand, the snooper 126 hasno queue available to handle the request, the snooper 126 may respondwith a partial response indicating that is the LPC for the memory block,but is unable to currently service the request. Similarly, a snooper 116in an L2 cache 110 may require an available instance of snoop logic andaccess to L2 cache directory 114 in order to handle a request. Absenceof access to either (or both) of these resources results in a partialresponse (and corresponding CR) signaling an inability to service therequest due to absence of a required resource.

III. Broadcast Flow of Exemplary Operation

Referring now to FIG. 4, which will be described in conjunction withFIGS. 5A-5C, there is illustrated a time-space diagram of an exemplaryoperation flow in data processing system 200 of FIG. 2. In thesefigures, the various processing units 100 within data processing system200 are tagged with two locational identifiers—a first identifying theprocessing node 202 to which the processing unit 100 belongs and asecond identifying the particular processing unit 100 within theprocessing node 202. Thus, for example, processing unit 100 ac refers toprocessing unit 100 c of processing node 202 a. In addition, eachprocessing unit 100 is tagged with a functional identifier indicatingits function relative to the other processing units 100 participating inthe operation. These functional identifiers include: (1) local master(LM), which designates the processing unit 100 that originates theoperation, (2) local hub (LH), which designates a processing unit 100that is in the same processing node 202 as the local master and that isresponsible for transmitting the operation to another processing node202 (a local master can also be a local hub), (3) remote hub (RH), whichdesignates a processing unit 100 that is in a different processing node202 than the local master and that is responsible to distribute theoperation to other processing units 100 in its processing node 202, and(4) remote leaf (RL), which designates a processing unit 100 that is ina different processing node 202 from the local master and that is not aremote hub.

As shown in FIG. 4, the exemplary operation has at least three phases asdescribed above with reference to FIG. 3, namely, a request (or address)phase, a partial response (Presp) phase, and a combined response (Cresp)phase. These three phases preferably occur in the foregoing order and donot overlap. The operation may additionally have a data phase, which mayoptionally overlap with any of the request, partial response andcombined response phases.

Still referring to FIG. 4 and referring additionally to FIG. 5A, therequest phase begins when a local master 100 ac (i.e., processing unit100 c of processing node 202 a) performs a synchronized broadcast of anoperation, for example, a read operation, to each of the local hubs 100aa, 100 ab, 100 ac and 100 ad within its processing node 202 a. Itshould be noted that the list of local hubs includes local hub 100 ac,which is also the local master. As described further below, thisinternal transmission is advantageously employed to synchronize theoperation of local hub 100 ac with local hubs 100 aa, 100 ab and 100 adso that the timing constraints discussed below can be more easilysatisfied.

In response to receiving the operation, each local hub 100 that iscoupled to a remote hub 100 by its A links transmits the operation toits remote hub 100. Thus, local hub 100 aa makes no further transmissionof the operation, but local hubs 100 ab, 100 ac and 100 ad transmit theoperation to remote hubs 100 ba, 100 ca and 100 da, respectively. Eachremote hub 100 receiving the operation in turn transmits the operationto each remote leaf 100 in its processing node 202. Thus, remote hub 100ba transmits the operation to remote leaves 100 bb, 100 bc and 100 bd,remote hub 100 ca transmits the operation to remote leaves 100 cb, 100ce and 100 cd, and remote hub 100 da transmits the operation to remoteleaves 100 db, 100 dc and 100 dd. In this manner, the operation isefficiently broadcast to all processing units 100 within data processingsystem 200 utilizing transmission over no more than three links.

Following the request phase, the partial response (Presp) phase occurs,as shown in FIGS. 4 and 5B. In the partial response phase, each remoteleaf 100 evaluates the operation and provides its partial response tothe operation to its respective remote hub 100. That is, remote leaves100 bb, 100 bc and 100 bd transmit their respective partial responses toremote hub 100 ba, remote leaves 100 cb, 100 cc and 100 cd transmittheir respective partial responses to remote hub 100 ca, and remoteleaves 100 db, 100 dc and 100 dd transmit their respective partialresponses to remote hub 100 da. Each of remote hubs 100 ba, 100 ca and100 da in turn transmits these partial responses, as well as its ownpartial response, to a respective one of local hubs 100 ab, 100 ac and100 ad. Local hubs 100 ab, 100 ac and 100 ad then forward these partialresponses, as well their own partial responses to local master 100 ac.In addition, local hub 100 aa forwards its partial response to localmaster 100 ac, concluding the partial response phase.

As will be appreciated, the collection of partial responses in themanner shown can be implemented in a number of different ways. Forexample, it is possible to communicate an individual partial responseback to the local master from each local hub, remote hub and remoteleaf. Alternatively, for greater efficiency, it may be desirable toaccumulate partial responses as they are communicated back to the localmaster. In order to ensure that the effect of each partial response isaccurately communicated back to local master 100 ac, it is preferredthat the partial responses be accumulated, if at all, in anon-destructive manner, for example, utilizing a logical OR function andan encoding in which no relevant information is lost when subjected tosuch a function (e.g., a “one-hot” encoding).

As further shown in FIG. 4 and FIG. 5C, response logic 122 at localmaster 100 ac compiles the partial responses of the other processingunits 100 to obtain a combined response representing the system-wideresponse to its operation. Local master 100 ac then broadcasts thecombined response to all processing units 100 following the same pathsof distribution as employed for the request phase. Thus, the combinedresponse is first broadcast to each of the local hubs 100 aa, 100 ab,100 ac and 100 ad within processing node 202 a. Again, for timingreasons, the local hubs 100 receiving the combined response includelocal hub 100 ac, which is also the local master. In response toreceiving the combined response, local hub 100 aa makes no furthertransmission of the operation, but local hubs 100 ab, 100 ac and 100 adtransmit the operation to remote hubs 100 ba, 100 ca and 100 da,respectively. Each remote hub 100 receiving the operation in turntransmits the operation to each remote leaf 100 in its processing node202. Thus, remote hub 100 ba transmits the operation to remote leaves100 bb, 100 bc and 100 bd, remote hub 100 ca transmits the operation toremote leaves 100 cb, 100 cc and 100 cd, and remote hub 100 da transmitsthe operation to remote leaves 100 db, 100 dc and 100 dd.

As noted above, servicing the operation may require an additional dataphase, such as shown in FIGS. 5D and 5E. For example, as shown in FIG.5D, if the operation is a read-type operation, such as a read or RWITMoperation, remote leaf 100 dd may source the requested memory block tolocal master 100 ac via the links connecting remote leaf 100 dd toremote hub 100 da, remote hub 100 da to local hub 100 ad, and local hub100 ad to local master 100 ac. Conversely, if the operation is awrite-type operation, for example, a cache castout operation writing amodified memory block back to the system memory 132 of remote leaf 100bb, the memory block is transmitted via the links connecting localmaster 100 ac to local hub 100 ab, local hub 100 ab to remote hub 100ba, and remote hub 100 ba to remote leaf 100 bb, as shown in FIG. 5E.

Of course, the scenario depicted in FIG. 4 and FIGS. 5A-5E is merelyexemplary of the myriad of possible operations that may occurconcurrently in a multiprocessor data processing system such as dataprocessing system 200.

IV. Timing Considerations

As described above with reference to FIG. 3, coherency is maintainedduring the “handoff” of coherency ownership of a memory block from asnooper 304 to a requesting master 300 in the possible presence of othermasters competing for ownership of the same memory block throughprotection windows 312 a-312 b. For example, as shown in FIG. 6,protection window 312 a must be of sufficient duration to protect thetransfer of coherency ownership of the requested memory block to winningmaster (WM) 300 in the presence of a competing request 322 by acompeting master (CM) 320. To ensure that protection window 312 a hassufficient duration to protect the transfer of ownership of therequested memory block to winning master 300, the latency ofcommunication between processing units 100 in accordance with FIG. 4 ispreferably constrained such that the following conditions are met:

A _(—) lat(CM _(—) S)≦A _(—) lat(CM _(—) WM)+C _(—) lat(WM _(—) S),

where A_lat(CM_S) is the address latency of any competing master (CM)320 to the snooper (S) 304 n owning coherence of the requested memoryblock, A_lat(CM_WM) is the address latency of any competing master (CM)320 to the “winning” master (WM) 300 that is awarded ownership bysnooper 304 n, and C_lat(WM_S) is the combined response latency from thetime that the combined response is received by the winning master (WM)300 to the time the combined response is received by the snooper (S) 304n owning the requested memory block.

If the foregoing timing constraint, which is applicable to a system ofarbitrary topology, is not satisfied, the request 322 of the competingmaster 320 may be received (i) by winning master 300 prior to winningmaster 300 assuming coherency ownership and initiating protection window312 b and (2) by snooper 304 n after protection window 312 a ends. Insuch cases, neither winning master 300 or snooper 304 n will provide apartial response to competing request 322 that prevents competing master320 from assuming coherency ownership of the memory block and readingnon-coherent data from memory.

Several observations may be made regarding the foregoing timingconstraint. First, the address latency from the competing master 320 tothe owning snooper 304 a has no necessary lower bound, but must have anupper bound. The upper bound is designed for by determining the worstcase latency attainable given, among other things, the maximum possibleoscillator drift, the longest links coupling processing units 100, themaximum number of accumulated stalls, and guaranteed worst casethroughput. In order to ensure the upper bound is observed, theinterconnect fabric must ensure non-blocking behavior.

Second, the address latency from the competing master 320 to the winningmaster 300 has no necessary upper bound, but must have a lower bound.The lower bound is determined by the best case latency attainable,given, among other things, the absence of stalls, the shortest possiblelink between processing units 100 and the slowest oscillator drift givena particular static configuration.

Although for a given operation, each of the winning master 300 andcompeting master 320 has only one timing bound for its respectiverequest, it will be appreciated that during the course of operation anyprocessing unit 100 may be a winning master for some operations and acompeting (and losing) master for other operations. Consequently, eachprocessing unit 100 effectively has an upper bound and a lower bound forits address latency.

Third, the combined response latency from the time that the combinedresponse is generated to the time the combined response is observed bythe winning master 300 has no necessary lower bound (the combinedresponse may arrive at the winning master 300 at an arbitrarily earlytime), but must have an upper bound. By contrast, the combined responselatency from the time that a combined response is generated until thecombined response is received by the snooper 304 n has a lower bound,but no necessary upper bound (although one may be arbitrarily imposed tolimit the number of operations concurrently in flight).

Fourth, there is no constraint on partial response latency. That is,because all of the terms of the timing constraint enumerated abovepertain to request/address latency and combined response latency, thepartial response latencies of snoopers 304 and competing master 320 towinning master 300 have no necessary upper or lower bounds.

V. Exemplary Link Information Allocation

The first tier and second tier links connecting processing units 100 maybe implemented in a variety of ways to obtain the topology depicted inFIG. 2 and to meet the timing constraints illustrated in FIG. 6. In onepreferred embodiment, each inbound and outbound first tier (X, Y and Z)link and each inbound and outbound second tier (A) link is implementedas a uni-directional 8-byte bus containing a number of different virtualchannels or tenures to convey address, data, control and coherencyinformation.

With reference now to FIGS. 7A-7B, there is illustrated a firstexemplary time-sliced information allocation for the first tier X, Y andZ links and second tier A links. As shown, in this first embodimentinformation is allocated on the first and second tier links in arepeating 8 cycle frame in which the first 4 cycles comprise two addresstenures transporting address, coherency and control information and thesecond 4 cycles are dedicated to a data tenure providing data transport.

Reference is first made to FIG. 7A, which illustrates the linkinformation allocation for the first tier links. In each cycle in whichthe cycle number modulo 8 is 0, byte 0 communicates a transaction type700 a (e.g., a read) of a first operation, bytes 1-5 provide the 5 loweraddress bytes 702 a 1 of the request address of the first operation, andbytes 6-7 form a reserved field 704. In the next cycle (i.e., the cyclefor which cycle number modulo 8 is 1), bytes 0-1 communicate a mastertag 706 a identifying the master 300 of the first operation (e.g., oneof L2 cache masters 112 or a master within I/O controller 128), and byte2 conveys the high address byte 702 a 2 of the request address of thefirst operation. Communicated together with this information pertainingto the first operation are up to three additional fields pertaining todifferent operations, namely, a local partial response 708 a intendedfor a local master in the same processing node 202 (bytes 3-4), acombined response 710 a in byte 5, and a remote partial response 712 aintended for a local master in a different processing node 202 (bytes6-7). As noted above, these first two cycles form what is referred toherein as an address tenure.

As further illustrated in FIG. 7A, the next two cycles (i.e., the cyclesfor which the cycle number modulo 8 is 2 and 3) form a second addresstenure having the same basic pattern as the first address tenure, withthe exception that reserved field 704 is replaced with a data tag 714and data token 715 forming a portion of the data tenure. Specifically,data tag 714 identifies the destination data sink to which the 32 bytesof data payload 716 a-716 d appearing in cycles 4-7 are directed. Itslocation within the address tenure immediately preceding the payloaddata advantageously permits the configuration of downstream steering inadvance of receipt of the payload data, and hence, efficient datarouting toward the specified data sink. Data token 715 provides anindication that a downstream queue entry has been freed and,consequently, that additional data may be transmitted on the paired X,Y, Z or A link without risk of overrun. Again it should be noted thattransaction type 700 b, master tag 706 b, low address bytes 702 b 1, andhigh address byte 702 b 2 all pertain to a second operation, and datatag 714, local partial response 708 b, combined response 710 b andremote partial response 712 b all relate to one or more operations otherthan the second operation.

FIG. 7B depicts the link information allocation for the second tier Alinks. As can be seen by comparison with FIG. 7A, the link informationallocation on the second tier A links is the same as that for the firsttier links given in FIG. 7A, except that local partial response fields708 a, 708 b are replaced with reserved fields 718 a, 718 b. Thisreplacement is made for the simple reason that, as a second tier link,no local partial responses need to be communicated.

FIG. 7C illustrates an exemplary embodiment of a write request partialresponse 720, which may be transported within either a local partialresponse field 708 a, 708 b or a remote partial response field 712 a,712 b in response to a write request. As shown, write request partialresponse 720 is two bytes in length and includes a 15-bit destinationtag field 724 for specifying the tag of a snooper (e.g., an IMC snooper126) that is the destination for write data and a 1-bit valid (V) flag722 for indicating the validity of destination tag field 724.

Referring now to FIGS. 8A-8B, there is depicted a second exemplarycyclical information allocation for the first tier X, Y and Z links andsecond tier A links. As shown, in the second embodiment information isallocated on the first and second tier links in a repeating 6 cycleframe in which the first 2 cycles comprise an address frame containingaddress, coherency and control information and the second 4 cycles arededicated to data transport. The tenures in the embodiment of FIGS.8A-8B are identical to those depicted in cycles 2-7 of FIGS. 7A-7B andare accordingly not described further herein. For write requests, thepartial responses communicated within local partial response field 808and remote partial response field 812 may take the form of write requestpartial response 720 of FIG. 7C.

It will be appreciated by those skilled in the art that the embodimentsof FIGS. 7A-7B and 8A-8B depict only two of a vast number of possiblelink information allocations. The selected link information allocationthat is implemented can be made programmable, for example, through ahardware and/or software-settable mode bit in a configuration register123 of FIG. 1. The selection of the link information allocation istypically based on one or more factors, such as the type of anticipatedworkload. For example, if scientific workloads predominate in dataprocessing system 200, it is generally more preferable to allocate morebandwidth on the first and second tier links to data payload. Thus, thesecond embodiment shown in FIGS. 8A-8B will likely yield improvedperformance. Conversely, if commercial workloads predominate in dataprocessing system 200, it is generally more preferable to allocate morebandwidth to address, coherency and control information, in which casethe first embodiment shown in FIGS. 7A-7B would support higherperformance. Although the determination of the type(s) of anticipatedworkload and the setting of configuration register 123 can be performedby a human operator, it is advantageous if the determination is made byhardware and/or software in an automated fashion. For example, in oneembodiment, the determination of the type of workload can be made byservice processor code executing on one or more of processing units 100or on a dedicated auxiliary service processor (not illustrated).

VI. Request Phase Structure and Operation

Referring now to FIG. 9, there is depicted a block diagram illustratingrequest logic 121 a within interconnect logic 120 of FIG. 1 utilized inrequest phase processing of an operation. As shown, the masters 300 of aprocessing unit (e.g., masters 112 within L2 cache 110 and masterswithin I/O controller 128), which may each initiate an operation, areeach coupled to a master multiplexer 900. The output of mastermultiplexer 900 and the output of a hold buffer 902 that receivesoperations on the inbound second tier A link form inputs to a requestmultiplexer 904. The output of request multiplexer 904 drives a requestbus 905 that is coupled to each of the outbound X, Y and Z links, aremote hub (RH) buffer 906, and the local hub (LH) address launchFirst-In, First-out (FIFO) queue 910.

The inbound first tier (X, Y and Z) links are each coupled to the LHaddress launch buffer 910, as well as a respective one of remote leaf(RL) buffers 914 a-914 c. The outputs of remote hub buffer 906, LHaddress launch buffer 910, and RL buffers 914 a-914 c all form inputs ofa snoop multiplexer 920. The output of snoop multiplexer 920 drives asnoop bus 922 to which a tag FIFO queue 924, the snoopers 304 (e.g.,snoopers 116 of L2 cache 110 and snoopers 126 of IMC 124) of theprocessing unit 100, and the outbound A link are coupled. Snoopers 304are further coupled to and supported by local hub (LH) partial responseFIFO queues 930 and remote hub (RH) partial response FIFO queues 940.

Although other embodiments are possible, it is preferable if buffers902, 906, and 914 a-914 c remain short in order to minimizecommunication latency. In one preferred embodiment, each of buffers 902,906, and 914 a-914 c is sized to hold only the address tenure(s) of asingle frame of the selected link information allocation.

With reference now to FIG. 10, there is illustrated a more detailedblock diagram of local hub (LH) address launch buffer 910 of FIG. 9. Asdepicted, the local and inbound X, Y and Z link inputs of the LH addresslaunch buffer 910 form inputs of a map logic 1010, which places requestsreceived on each particular input into a respective correspondingposition-dependent FIFO queue 1020 a-1020 d. In the depictednomenclature, the processing unit 100 a in the upper left-hand corner ofa processing node/MCM 202 is the “S” chip; the processing unit 100 b inthe upper right-hand corner of the processing node/MCM 202 is the “T”chip; the processing unit 10 c in the lower left-hand corner of aprocessing node/MCM 202 is the “U” chip; and the processing unit 100 din the lower right-hand corner of the processing node 202 is the “V”chip. Thus, for example, for local master/local hub 100 ac, requestsreceived on the local input are placed by map logic 1010 in U FIFO queue1020 c, and requests received on the inbound Y link are placed by maplogic 1010 in S FIFO queue 1020 a. Map logic 1010 is employed tonormalize input flows so that arbitration logic 1032, described below,in all local hubs 100 is synchronized to handle requests identicallywithout employing any explicit inter-communication.

Although placed within position-dependent FIFO queues 1020 a-1020 d,requests are not immediately marked as valid and available for dispatch.Instead, the validation of requests in each of position-dependent FIFOqueues 1020 a-1020 d is subject to a respective one of programmabledelays 1000 a-1000 d in order to synchronize the requests that arereceived during each address tenure on the four inputs. Thus, theprogrammable delay 1000 a associated with the local input, whichreceives the request self-broadcast at the local master/local hub 100,is generally considerably longer than those associated with the otherinputs. In order to ensure that the appropriate requests are validated,the validation signals generated by programmable delays 1000 a-1000 dare subject to the same mapping by map logic 1010 as the underlyingrequests.

The outputs of position-dependent FIFO queues 1020 a-1020 d form theinputs of local hub request multiplexer 1030, which selects one requestfrom among position-dependent FIFO queues 1020 a-1020 d for presentationto snoop multiplexer 920 in response to a select signal generated byarbiter 1032. Arbiter 1032 implements a fair arbitration policy that issynchronized in its selections with the arbiters 1032 of all other localhubs 100 within a given processing node 202 so that the same request isbroadcast on the outbound A links at the same time by all local hubs 100in a processing node 202, as depicted in FIGS. 4 and 5A. Thus, giveneither of the exemplary link information allocation shown in FIGS. 7Band 8B, the output of local hub request multiplexer 1030 istimeslice-aligned to the address tenure(s) of an outbound A link requestframe.

Because the input bandwidth of LH address launch buffer 910 is fourtimes its output bandwidth, overruns of position-dependent FIFO queues1020 a-1020 d are a design concern. In a preferred embodiment, queueoverruns are prevented by implementing, for each position-dependent FIFOqueue 1020, a pool of local hub tokens equal in size to the depth of theassociated position-dependent FIFO queue 1020. A free local hub token isrequired for a local master to send a request to a local hub andguarantees that the local hub can queue the request. Thus, a local hubtoken is allocated when a request is issued by a local master 100 to aposition-dependent FIFO queue 1020 in the local hub 100 and freed forreuse when arbiter 1032 issues an entry from the position-dependent FIFOqueue 1020.

Referring now to FIG. 11, there is depicted a more detailed blockdiagram of tag FIFO queues 924 of FIG. 9. As shown, tag FIFO queues 924include a local hub (LH) tag FIFO queue 924 a, a remote hub (RH) tagFIFO queue 924 b, and remote leaf (RL) tag FIFO queues 924 c-924 e. Themaster tag of a request is deposited in each of LH, RH and RL FIFOqueues 924 a-924 e when the request is received at the processingunit(s) 100 serving in each of these given roles (LH, RH and RL) forthat particular request. The master tag is retrieved from each of FIFOqueues 924 when the combined response is received at the associatedprocessing unit 100. Thus, rather than transporting the master tag withthe combined response, master tags are retrieved by a processing unit100 from its FIFO queue 924 as needed, resulting in bandwidth savings onthe X, Y, Z and A links. Given that the order in which a combinedresponse is received at the various processing units 100 is identical tothe order in which the associated request was received, a FIFO policyfor allocation and retrieval of the master tag can advantageously beemployed.

LH tag FIFO queue 924 a includes a number of entries, each including amaster tag field 1100 for storing the master tag of a request launchedby arbiter 1032 and a position field 1102 for storing an indication ofthe position (i.e., S, T, U or V) of the local master 100 of therequest. RH tag FIFO queue 924 b similarly includes multiple entries,each including at least a master tag field 1100 for storing the mastertag of a request received via the inbound A link. RL tag FIFO queues 924c-924 e are similarly constructed and respectively hold master tags ofrequests received by a remote leaf 100 via the inbound X, Y and Z links.

With reference now to FIGS. 12A and 12B, there are illustrated moredetailed block diagrams of exemplary embodiments of the local hub (LH)partial response FIFO queue 930 and remote hub (RH) partial responseFIFO queue 940 of FIG. 9. As indicated, LH partial response FIFO queue930 includes a number of entries 1200 that each includes a partialresponse field 1202 for storing an accumulated partial response for arequest and a response flag array 1204 having respective flags for eachof the 5 possible sources from which the local hub 100 may receive apartial response (i.e., local (L), first tier X, Y, Z links, and secondtier A link) at different times or possibly simultaneously. Entries 1200within LH partial response FIFO queue 930 are allocated via anallocation pointer 1210 and deallocated via a deallocation pointer 1212.Various flags comprising response flag array 1204 are accessed utilizingA pointer 1214, X pointer 1216, Y pointer 1218, and Z pointer 1220.

As described further below, when a partial response for a particularrequest is received by partial response logic 121 b at a local hub 100,the partial response is accumulated within partial response field 1202,and the link from which the partial response was received is recorded bysetting the corresponding flag within response flag array 1204. Thecorresponding one of pointers 1214, 1216, 1218 and 1220 is then advancedto the subsequent entry 1200. Of course, if a processing unit 100 is notfully connected, meaning that a partial response will not be received onone or more of its links, the corresponding flag(s) within response flagarray 1204 are ignored. The determination of which links of theprocessing unit are connected, and thus which flags are active for eachprocessing unit 100, can be made by reference to a configurationregister (not illustrated).

As can be seen by comparison of FIG. 12B and FIG. 12A, RH partialresponse FIFO queue 940 is constructed similarly to LH partial responseFIFO queue 930. RH partial response FIFO queue 940 includes a number ofentries 1230 that each includes a partial response field 1202 forstoring an accumulated partial response and a response flag array 1234having respective flags for each of the 4 possible sources from whichthe remote hub may receive a partial response (i.e., remote (R), andfirst tier X, Y, and Z links). Entries 1230 within RH partial responseFIFO queue 940 are allocated via an allocation pointer 1210 anddeallocated via a deallocation pointer 1212. Various flags comprisingresponse flag array 1234 are accessed and updated utilizing X pointer1216, Y pointer 1218, and Z pointer 1220. Of course, if a processingunit 100 is not fully connected, meaning that a partial response willnot be received on one or more of its links, the corresponding flag(s)within response flag array 1234 are ignored. The determination of whichlinks of the processing unit are connected, and thus which flags areactive for each processing unit 100, can be made by reference to aconfiguration register (not illustrated).

Referring now to FIG. 13, there is depicted a time-space diagramillustrating the tenure of an exemplary operation with respect to theexemplary data structures depicted in FIG. 9 through FIG. 12B. As shownat the top of FIG. 13 and as described previously with reference to FIG.4, the operation is issued by local master 100 ac to each local hub 100,including local hub 100 ab. Local hub 100 ab forwards the operation toremote hub 100 ba, which in turn forwards the operation to its remoteleaves, including remote leaf 100 bd. The partial responses to theoperation traverse the same series of links in reverse order back tolocal master 100 ac, which broadcasts the combined response to eachprocessing unit 100, including local hub 100 ab, remote hub 100 ba, andremote leaf 100 bd. As dictated by the timing constraints describedabove, the time from the initiation of the operation by local master 100ac to its launch by the local hubs 100 aa, 100 ab, 100 ac and 100 ad isa variable time, the time from the launch of the operation by local hubs100 to its receipt by the remote leaves 100 is a bounded time, thepartial response latency from the remote leaves 100 to the local master100 is a variable time, and the combined response latency from the localmaster 100 to the remote leaves 100 is a bounded time.

Against the backdrop of this timing sequence, FIG. 13 illustrates thetenures of various items of information within various data structureswithin data processing system 200 during the request phase, partialresponse phase, and combined response phase of an operation. Inparticular, the tenure of a request in a LH launch buffer 910 (and hencethe tenure of a local hub token) is depicted at reference numeral 1300,an the tenure of an entry in LH tag FIFO queue 924 a is depicted atreference numeral 1302, the tenure of an entry 1200 in LH partialresponse FIFO queue 930 is depicted at block 1304, the tenure of anentry in a RH tag FIFO 924 b is depicted at reference numeral 1306, thetenure of an entry 1230 in a RH partial response FIFO queue 940 isdepicted at reference numeral 1308, and the tenure of an entry in the RLtag FIFO queues 924 c, 924 d and 924 e is depicted at reference numeral1310. FIG. 13 further illustrates the duration of a protection window1312 (also 312 a of FIGS. 3 and 6) extended by the snooper within remoteleaf 100 dd to protect the transfer of ownership of the memory block tolocal master 100 ac from generation of its partial response untilreceipt of the combined response (reference numeral 1312). As shown atreference numeral 1314 (and also at 312 b of FIGS. 3 and 6), localmaster 100 ac also protects the transfer of ownership from receipt ofthe combined response.

As indicated at reference numerals 1302, 1306 and 1310, the entries inthe LH tag FIFO queue 924 a, RH tag FIFO queue 924 b and RL tag FIFOqueue 924 c-924 e are subject to the longest tenures. Consequently, theminimum depth of tag FIFO queues 924 (which are generally designed to bethe same) limits the maximum number of requests that can be in flight inthe data processing system at any one time. In general, the desireddepth of tag FIFO queues 924 can be selected by dividing the expectedmaximum latency from snooping of a request by an arbitrarily selectedprocessing unit 100 to receipt of the combined response by thatprocessing unit 100 by the maximum number of requests that can be issuedgiven the selected link information allocation. Although the otherqueues (e.g., LH partial response FIFO queue 930 and RH partial responseFIFO queue 940) may safely be assigned shorter queue depths given theshorter tenure of their entries, for simplicity it is desirable in atleast some embodiments to set the depths of these queues to be the sameas tag FIFO queues 924 a-924 e.

With reference now to FIG. 14A-14D, flowcharts are given thatrespectively depict exemplary processing of an operation during therequest phase at a local master, local hub, remote hub, and remote leafin accordance with an exemplary embodiment of the present invention.Referring now specifically to FIG. 14A, request phase processing at thelocal master 100 begins at block 1400 with the generation of a requestby a particular master 300 (e.g., one of masters 112 within an L2 cache110 or a master within an I/O controller 128) within a local master 100.Following block 1400, the process proceeds to blocks 1402, 1404, 1406,and 1408, each of which represents a condition on the issuance of therequest by the particular master 300. The conditions illustrated atblocks 1402 and 1404 represent the operation of master multiplexer 900,and the conditions illustrated at block 1406 and 1408 represent theoperation of request multiplexer 904.

Turning first to blocks 1402 and 1404, master multiplexer 900 outputsthe request of the particular master 300 if the fair arbitration policygoverning master multiplexer 900 selects the request of the particularmaster 300 from the requests of possibly) multiple competing masters 300(block 1402) and if a local hub token is available for assignment to therequest (block 1404).

Assuming that the request of the particular master 300 progressesthrough master multiplexer 900 to request multiplexer 904, requestmultiplexer 904 issues the request on request bus 905 only if a addresstenure is then available for a request in the outbound first tier linkinformation allocation (block 1406). That is, the output of requestmultiplexer 904 is timeslice aligned with the selected link informationallocation and will only generate an output during cycles designed tocarry a request (e.g., cycle 0 or 2 of the embodiment of FIG. 7A orcycle 0 of the embodiment of FIG. 8A). As further illustrated at block1408, request multiplexer 904 will only issue a request if no request isinbound on the second tier A link (block 1406), which is always givenpriority. Thus, the second tier links are guaranteed to be non-blockingwith respect to inbound requests. Even with such a non-blocking policy,requests by masters 300 can prevented from “starving” throughimplementation of an appropriate policy in the arbiter 1032 of theupstream hub that prevents “brickwalling” of requests during numerousconsecutive address tenures on the inbound A link of the downstream hub.

If a negative determination is made at any of blocks 1402-1408, therequest is delayed, as indicated at block 1410, until a subsequent cycleduring which all of the determinations illustrated at blocks 1402-1408are positive. If, on the other hand, positive determinations are made atall of blocks 1402-1408, the process proceeds to block 1412, beginningtenure 1300 of FIG. 13. Block 1412 depicts request multiplexer 904broadcasting the request on request bus 905 to each of the outbound X, Yand Z links and to the local hub address launch buffer 910. Thereafter,the process bifurcates and passes through page connectors 1414 and 1416to FIG. 14B, which illustrates the processing of the request at each ofthe local hubs 100.

With reference now to FIG. 14B, processing of the request at the localhub 100 that is also the local master 100 is illustrated beginning atblock 1416, and processing of the request at each of the other localhubs 100 in the same processing node 202 as the local master 100 isdepicted beginning at block 1414. Turning first to block 1414, requestsreceived by a local hub 100 on the inbound X, Y and Z links are receivedby LH address launch buffer 910. As depicted at block 1420 and in FIG.10, map logic 1010 maps each of the X, Y and Z requests to theappropriate ones of position-dependent FIFO queues 1020 a-1020 d forbuffering. As noted above, requests received on the X, Y and Z links andplaced within position-dependent queues 1020 a-1020 d are notimmediately validated. Instead, the requests are subject to respectiveones of tuning delays 1000 a-1000 d, which synchronize the handling ofthe X, Y and Z requests and the local request on a given local hub 100with the handling of the corresponding requests at the other local hubs100 in the same processing node 202 (block 1422). Thereafter, as shownat block 1430, the tuning delays 1000 validate their respective requestswithin position-dependent FIFO queues 1020 a-1020 d.

Referring now to block 1416, at the local master/local hub 100, therequest on request bus 905 is fed directly into LH address launch buffer910. Because no inter-chip link is traversed, this local request arrivesat LH address launch FIFO 910 earlier than requests issued in the samecycle arrive on the inbound X, Y and Z links. Accordingly, following themapping by map logic 1010, which is illustrated at block 1424, one oftuning delays 1000 a-100 d applies a long delay to the local request tosynchronize its validation with the validation of requests received onthe inbound X, Y and Z links (block 1426). Following this delayinterval, the relevant tuning delay 1000 validates the local request, asshown at block 1430.

Following the validation of the requests queued within LH address launchbuffer 910 at block 1430, the process then proceeds to blocks 1434-1440,each of which represents a condition on the issuance of a request fromLH address launch buffer 910 enforced by arbiter 1032. As noted above,the arbiters 1032 within all processing units 100 are synchronized sothat the same decision is made by all local hubs 100 withoutinter-communication. As depicted at block 1434, an arbiter 1032 permitslocal hub request multiplexer 1030 to output a request only if anaddress tenure is then available for the request in the outbound secondtier link information allocation. Thus, for example, arbiter 1032 causeslocal hub request multiplexer 1030 to initiate transmission of requestsonly during cycle 0 or 2 of the embodiment of FIG. 7B or cycle 0 of theembodiment of FIG. 8B. In addition, a request is output by local hubrequest multiplexer 1030 if the fair arbitration policy implemented byarbiter 1032 determines that the request belongs to theposition-dependent FIFO queue 1020 a-1020 d that should be servicednext. As depicted further at block 1438, arbiter 1032 causes local hubrequest multiplexer 1030 to output a request only if it determines thatit has not been outputting too many requests in successive addresstenures, possibly “starving” the masters 300 in the processing unit 100coupled to its outbound A link. Finally, arbiter 1032 permits a requestto be output by local hub request multiplexer 1030 only if an entry isavailable for allocation in LH tag FIFO queue 924 a (block 1440).

If a negative determination is made at any of blocks 1434-1440, therequest is delayed, as indicated at block 1442, until a subsequent cycleduring which all of the determinations illustrated at blocks 1434-1440are positive. If, on the other hand, positive determinations are made atall of blocks 1434-1440, arbiter 1032 signals local hub requestmultiplexer 1030 to output the selected request to an input ofmultiplexer 920, which always gives priority to a request, if any,presented by LH address launch buffer 910. Thus, multiplexer 920 issuesthe request on snoop bus 922. It should be noted that the other ports ofmultiplexer 920 (e.g., RH, RLX, RLY, and RLZ) could present requestsconcurrently with LH address launch buffer 910, meaning that the maximumbandwidth of snoop bus 922 must equal 10/8 (assuming the embodiment ofFIG. 7B) or 5/6 (assuming the embodiment of FIG. 8B) of the bandwidth ofthe outbound A link in order to keep up with maximum arrival rate.

It should also be observed that only requests buffered within local hubaddress launch buffer 910 are transmitted on the outbound A link and arerequired to be aligned with address tenures within the link informationallocation. Because all other requests competing for issuance bymultiplexer 920 target only the local snoopers 304 and their respectiveFIFO queues rather than the outbound A link, such requests may be issuedin the remaining cycles of the information frames. Consequently,regardless of the particular arbitration scheme employed by multiplexer920, all requests concurrently presented to multiplexer 920 areguaranteed to be transmitted within the latency of a single informationframe.

As indicated at block 1444, in response to the issuance of the requeston snoop bus 922, LH tag FIFO queue 924 a records the master tagspecified in the request and the link from which the request wasreceived (local, X, Y or Z link) from the local master 100 in the mastertag field 1100 and source field 1102, respectively, of an entry,beginning tenure 1302 The request is then routed to the outbound A link,as shown at block 1446. The process then passes through page connector1448 to FIG. 14B, which depicts the processing of the request at theremote hub during the request phase.

The process depicted in FIG. 14B also proceeds from block 1446 to block1450, which illustrates local hub 100 freeing the local hub tokenallocated to the request in response to the removal of the request fromLH address launch buffer 910, ending tenure 1300. The request is furtherrouted to the snoopers 304 in the local hub 100, as shown at block 1452.In response to receipt of the request, snoopers 304 generate a partialresponse (block 1454), which is recorded within LH partial response FIFOqueue 930, beginning tenure 1304 (block 1456). In particular, at block1456, an entry 1200 in the LH partial response FIFO queue 930 isallocated to the request by reference to allocation pointer 1210,allocation pointer 1210 is incremented, the partial response of thelocal hub is placed within the partial response field 1202 of theallocated entry, and the local (L) flag is set in the response flagfield 1204. Thereafter, request phase processing at the local hub endsat block 1458.

Referring now to FIG. 14C, there is depicted a high level logicalflowchart of an exemplary method of request processing at a remote hubin accordance with the present invention. As depicted, the processbegins at page connector 1448 upon receipt of the request at the remotehub 100 on its inbound A link. As noted above, the inbound A link isnon-blocking, meaning that after the request is latched into a holdbuffer 902 as shown at block 1460, the request is transmitted by requestmultiplexer 904 as soon as the next address tenure is available in thefirst tier link information allocation. Thus, as shown at block 1464, ifno address tenure is available, multiplexer 904 waits for the nextinformation frame as shown at block 1466. If an address tenure isavailable, the process proceeds from block 1464 to block 1468, whichillustrates multiplexer 904 broadcasting the request on request bus 905to the outbound X, Y and Z links and RH hold buffer 906.

Following block 1468, the process bifurcates. A first path passesthrough page connector 1470 to FIG. 14D, which illustrates an exemplarymethod of request processing at the remote leaves 100. The second pathfrom block 1468 proceeds to block 1474, which illustrates the snoopmultiplexer 920 determining which of the requests presented at itsinputs to output on snoop bus 922. As indicated, snoop multiplexer 920prioritizes local hub requests over remote hub requests, which are inturn prioritized over requests buffered in remote leaf buffers 914 a-914c. Thus, if a local hub request is presented for selection by LH addresslaunch buffer 910, the request buffered within remote hub buffer 906 isdelayed, as shown at block 1476. If, however, no request is presented byLH address launch buffer 910, snoop multiplexer 920 issues the requestfrom remote hub buffer 906 on snoop bus 922.

In response to detecting the request on snoop bus 922, RH tag FIFO 924 bplaces the master tag specified by the request into master tag field1100 of its next available entry, beginning tenure 1306 (block 1478).The request is further routed to the snoopers 304 in the remote hub 100,as shown at block 1480. In response to receipt of the request, snoopers304 generate a partial response at block 1482, which is recorded withinRH partial response FIFO queue 940, beginning tenure 1308 (block 1484).In particular, an entry 1230 in the RH partial response FIFO queue 940is allocated to the request by reference to its allocation pointer 1210,the allocation pointer 1210 is incremented, the partial response of theremote hub is placed within the partial response field 1202, and theremote flag (R) is set in the response flag field 1234. Thereafter,request phase processing at the remote hub ends at block 1486.

With reference now to FIG. 14D, there is illustrated a high levellogical flowchart of an exemplary method of request processing at aremote leaf 100 in accordance with the present invention. As shown, theprocess begins at page connector 1470 upon receipt of the request at theremote leaf 100 on one of its inbound X, Y and Z links. As indicated atblock 1490, in response to receipt of the request, the request islatched into one of RL hold buffers 914 a-914 c. Next, as depicted atblock 1491, the request is evaluated by snoop multiplexer 920 togetherwith the other requests presented to its inputs. As discussed above,snoop multiplexer 920 prioritizes local hub requests over remote hubrequests, which are in turn prioritized over requests buffered in remoteleaf buffers 914 a-914 c. Thus, if a local hub or remote hub request ispresented for selection, the request buffered within the RL hold buffer914 is delayed, as shown at block 1492. If, however, no higher priorityrequest is presented to snoop multiplexer 920, snoop multiplexer 920issues the request from the RL hold buffer 914 on snoop bus 922, fairlychoosing between X, Y and Z requests.

In response to detecting the request on snoop bus 922, the RL tag FIFO924 c-924 e associated with in the inbound first tier link on which therequest was received places the master tag specified by the request intothe master tag field 1100 of its next available entry, beginning tenure1310 (block 1493). The request is further routed to the snoopers 304 inthe remote leaf 100, as shown at block 1494. In response to receipt ofthe request, the snoopers 304 process the request, generate theirrespective partial response, and accumulate the partial responses toobtain the partial response of that processing unit 100 (block 1495). Asindicated by page connector 1497, the partial response of the snoopers304 of the remote leaf 100 is handled in accordance with FIG. 16A, whichis described below.

FIG. 14E is a high level logical flowchart of an exemplary method bywhich snooper s304 generate partial responses for requests, for example,at blocks 1454, 1482 and 1495 of FIGS. 14B-14D. The process begins atblock 1401 in response to receipt by a snooper 304 (e.g., an IMC snooper126, L2 cache snooper 116 or a snooper within an I/O controller 128) ofa request. In response to receipt of the request, the snooper 304determines by reference to the transaction type specified by the requestwhether or not the request is a write-type request, such as a castoutrequest, write request, or partial write request. In response to thesnooper 304 determining at block 1403 that the request is not awrite-type request (e.g., a read or RWITM request), the process proceedsto block 1405, which illustrates the snooper 304 generating the partialresponse for the request, if required, by conventional processing. If,however, the snooper 304 determines that the request is write-typerequest, the process proceeds to block 1407.

Block 1407 depicts the snooper 304 determining whether or not it is theLPC for the request address specified by the write-type request. Forexample, snooper 304 may make the illustrated determination by referenceto one or more base address registers (BARs) and/or address hashfunctions specifying address range(s) for which the snooper 304 isresponsible (i.e., the LPC). If snooper 304 determines that it is notthe LPC for the request address, the process passes to block 1409. Block1409 illustrates snooper 304 generating a write request partial response720 (FIG. 7C) in which the valid field 722 and the destination tag field724 are formed of all ‘0’s, thereby signifying that the snooper 304 isnot the LPC for the request address. If, however, snooper 304 determinesat block 1407 that it is the LPC for the request address, the processpasses to block 1411, which depicts snooper 304 generating a writerequest partial response 720 in which valid field 722 is set to ‘1’ anddestination tag field 724 specifies a destination tag or route thatuniquely identifies the location of snooper 304 within data processingsystem 200. Following either of blocks 1409 or 1411, the process shownin FIG. 14E ends at block 1413.

Referring now to FIG. 15, there is depicted a block diagram illustratingan exemplary embodiment of the partial response logic 121 b withininterconnect logic 120 of FIG. 1. As shown, partial response logic 121 bincludes route logic 1500 that routes a remote partial responsegenerated by the snoopers 304 at a remote leaf 100 back to the remotehub 100 from which the request was received via the appropriate one ofoutbound first tier X, Y and Z links. In addition, partial responselogic 121 b includes combining logic 1502 and route logic 1504, whichrespectively combine partial responses received from remote leaves 100and route such partial responses from RH partial response FIFO queue 940to the local hub 100 via the outbound A link. Partial response logic 121b further includes combining logic 1506 and route logic 1508, whichrespectively combine partial responses received from remote hubs 100 androute such partial responses from LH partial response FIFO queue 930 tothe local master 100 via the appropriate one of the X, Y and Z links.Finally, partial response logic 121 b includes combining logic 1510,which accumulates partial responses received from local hubs 100 andpasses the accumulated partial response to response logic 122 forgeneration of the combined response for the request.

With reference now to FIG. 16A-16D, there are illustrated flowchartsrespectively depicting exemplary processing during the partial responsephase of an operation at a remote leaf, remote hub, local hub, and localmaster. In these figures, transmission of partial responses may besubject to various delays that are not explicitly illustrated. However,because there is no timing constraint on partial response latency asdiscussed above, such delays, if present, will not induce errors inoperation and are accordingly not described further herein.

Referring now specifically to FIG. 16A, partial response phaseprocessing at the remote leaf 100 begins at block 1600 when the snoopers304 of the remote leaf 100 generate partial responses for the request.As shown at block 1602, route logic 1500 then routes, using the remotepartial response field 712 or 812 of the link information allocation,the partial response to the remote hub 100 for the request via theoutbound X, Y or Z link corresponding to the inbound first tier link onwhich the request was received. As indicated above, the inbound firsttier link on which the request was received is indicated by which one ofRL tag FIFO queue 924 c-924 e holds the master tag for the request.Thereafter, partial response processing continues at the remote hub 100,as indicated by page connector 1604 and as described below withreference to FIG. 16B.

With reference now to FIG. 16B, there is illustrated a high levellogical flowchart of an exemplary embodiment of a method of partialresponse processing at a remote hub in accordance with the presentinvention. The illustrated process begins at page connector 1604 inresponse to receipt of the partial response of one of the remote leaves100 coupled to the remote hub 100 by one of the first tier X, Y and Zlinks. In response to receipt of the partial response, combining logic1502 reads out the entry 1230 within RH partial response FIFO queue 940allocated to the operation. The entry is identified by the FIFO orderingobserved within RH partial response FIFO queue 940, as indicated by theX, Y or Z pointer 1216-1220 associated with the link on which thepartial response was received. Combining logic 1502 then accumulates thepartial response of the remote leaf 100 with the contents of the partialresponse field 1202 of the entry 1230 that was read. As mentioned above,the accumulation operation is preferably a non-destructive operation,such as a logical OR operation. Next, combining logic 1502 determines atblock 1614 by reference to the response flag array 1234 of the entry1230 whether, with the partial response received at block 1604, all ofthe remote leaves 100 have reported their respective partial responses.If not, the process proceeds to block 1616, which illustrates combininglogic 1502 updating the partial response field 1202 of the entry 1230allocated to the operation with the accumulated partial response,setting the appropriate flag in response flag array 1234 to indicatewhich remote leaf 100 provided a partial response, and advancing theassociated one of pointers 1216-1220.

Referring again to block 1614, in response to a determination bycombining logic 1502 that all remote leaves 100 have reported theirrespective partial responses for the operation, combining logic 1502deallocates the entry 1230 for the operation from RH partial responseFIFO queue 940 by reference to deallocation pointer 1212, ending tenure1308 (block 1620). Combining logic 1502 also routes the accumulatedpartial response to the outbound A link utilizing the remote partialresponse field 712 or 812 in the link allocation information, asdepicted at block 1622. Thereafter, the process passes through pageconnector 1624 to FIG. 16C.

Referring now to FIG. 16C, there is depicted a high level logicalflowchart of an exemplary method of partial response processing at alocal hub in accordance with an embodiment of the present invention. Theprocess begins at block 1624 in response to receipt at the local hub 100of a partial response from a remote hub 100 via the inbound A link. Inresponse to receipt of the partial response, combining logic 1506 readsout the entry 1200 within LH partial response FIFO queue 930 allocatedto the operation. The entry is identified by the FIFO ordering observedwithin LH partial response FIFO queue 930, as indicated by the A pointer1214. Combining logic 1506 then accumulates the partial response of theremote hub 100 with the contents of the partial response field 1202 ofthe entry 1200 that was read. As shown at blocks 1634, combining logic1502 further determines by reference to the source field 1102 of therelevant entry in LH tag FIFO queue 924 a whether the local hub 100 isthe local master 100 of the operation. If not, the process proceeds toblock 1636; otherwise, the process passes to block 1642.

At block 1636, combining logic 1506 deallocates the entry 1200 for theoperation from LH partial response FIFO queue 930 by reference todeallocation pointer 1212, ending tenure 1304 Combining logic 1506 thenpasses the accumulated partial response and a source indication to routelogic 1508, which routes the accumulated partial response to the localmaster 100 via the local partial response field 708 or 808 in the linkinformation allocation of the particular first tier link indicated bythe source indication 1102, as depicted at block 1638. Thereafter, theprocess passes through page connector 1640 to FIG. 16D, whichillustrates partial response processing at the local master.

Referring now to block 1642, if the local hub 100 is the local master100, combining logic 1506 determines by reference to the response flagarray 1204 of the entry 1200 whether, with the partial response receivedat block 1624, all of the local hubs 100 and the remote hub, if any,have reported their respective partial responses. If not, the processproceeds to block 1644, which illustrates combining logic 1506 updatingthe partial response field 1202 of the entry 1200 allocated to theoperation with the accumulated partial response, setting the appropriateflag in response flag array 1204 to indicate the link on which thepartial response was received, and advancing the A pointer 1214associated with the newly set flag to the next entry. Thereafter, themethod of partial response processing at the local hub ends at block1646.

Referring again to block 1642, if combining logic 1506 determines thatall local hubs and the remote hub, if any, have all reported theirrespective partial responses for the operation, combining logic 1506deallocates the entry 1200 for the operation from LH partial responseFIFO queue 930 by reference to deallocation pointer 1212, ending tenure1304 (block 1650). Combining logic 1506 also routes the accumulatedpartial response to response logic 122, as depicted at block 1652.Thereafter, the process passes through page connector 1654 to FIG. 18A,which is described below.

With reference now to FIG. 16D, there is illustrated a high levellogical flowchart of an exemplary method of partial response processingat a local master in accordance with the present invention. The processbegins at block 1640 in response to receipt at the local master 100 of apartial response from a local hub 100 via one of the inbound first tierX, Y and Z links. In response to receipt of the partial response,combining logic 1510 reads out the entry 1200 within LH partial responseFIFO queue 930 allocated to the operation. The entry is identified bythe FIFO ordering observed within LH partial response FIFO queue 930, asindicated by the X, Y or Z pointer 1216-1220 associated with the link onwhich the partial response was received. Combining logic 1510 thenaccumulates the partial response of the local hub 100 with the contentsof the partial response field 1202 of the entry 1200 that was read. Asshown at block 1664, combining logic 1502 further determines byreference to the source field 1102 of the relevant entry in LH tag FIFOqueue 924 b whether the local hub 100 is the local master 100. If not,an error occurs that causes processing to halt, as shown at block 1666.If, however, an affirmative determination is made at block 1664, theprocess proceeds to block 1668.

Block 1668 depicts combining logic 1510 determining by reference toresponse flag array 1204 whether, with the partial response received atblock 1640, all of the local hubs 100 and the remote hub 100, if any,have reported their respective partial responses for the operation. Ifnot, the process proceeds to block 1670, which illustrates combininglogic 1510 updating the partial response field 1202 of the entry 1200allocated to the operation with the accumulated partial response,setting the appropriate flag in response flag array 1204 to indicate thelink on which the partial response was received, and advancing to thenext entry the pointer among pointers 1216-1220 associated with thenewly set flag. Thereafter, the method of partial response processing atthe local master ends at block 1672.

Referring again to block 1668, if combining logic 1510 determines thatall local hubs 100 and the remote hub 100, if any, have all reportedtheir respective partial responses for the operation, combining logic1510 deallocates the entry 1200 for the operation from LH partialresponse FIFO queue 930 by reference to deallocation pointer 1212,ending tenure 1304 at the local master/local hub 100 (block 1680).Combining logic 1510 also routes the accumulated partial response toresponse logic 122, as depicted at block 1682. Thereafter, the processpasses through page connector 1684 to FIG. 18A, which is describedbelow.

Referring now to FIG. 17, there is depicted a block diagram of exemplaryembodiment of the combined response logic 121 c within interconnectlogic 120 of FIG. 1 in accordance with the present invention. As shown,combined response logic 121 c includes a first multiplexer 1704 having afirst input to which response logic 122 is coupled and a second inputcoupled to a hold buffer 1702 that receives frames on the inbound Alink. The output of first multiplexer 1704 drives a first bus 1705 thatis coupled to each of the outbound X, Y and Z links, a remote hub (RH)buffer 1706, and the local hub (LH) combined response launch buffer1710.

The inbound first tier X, Y and Z links are each coupled to the LHcombined response launch buffer 1710, as well as a respective one ofremote leaf (RL) buffers 1714 a-1714 c. The outputs of remote hub buffer1706, LH combined response launch buffer 1710, and RL buffers 1714a-1714 c all form inputs of a second multiplexer 1720. The output ofsecond multiplexer 1720 drives a second bus 1722 to which tag FIFOqueues 924, the snoopers 304 (e.g., snoopers 116 of L2 cache 110 andsnoopers 126 of IMC 124) of the processing unit 100, and the outbound Alink are coupled. Masters 300 and response logic 122 are further coupledto tag FIFO queues 924. The connection to tag FIFO queue 924 permits thelocal master to observe the combined response and the accumulatedpartial response from which the combined response was produced beforeany snooper at a local hub observes the combined response. Consequently,the timing constraint term regarding the combined response latency fromthe winning master 300 to snooper 304 n (i.e., C_lat(WM_S)) is non-zeroand, more particularly, approximately equivalent to the first tier linktransmission latency. Consequently, the timing constraint set forthabove can be met.

With reference now to FIG. 18A-18D, there are depicted high levellogical flowcharts respectively depicting exemplary combined responsephase processing at a local master, local hub, remote hub, and remoteleaf in accordance with an exemplary embodiment of the presentinvention. Referring now specifically to FIG. 18A, combined responsephase processing at the local master 100 begins at block 1800 and thenproceeds to block 1802, which depicts response logic 122 generating thecombined response for an operation based upon the type of request andthe accumulated partial response. Response logic 122 presents thecombined response to the first input of first multiplexer 1704 and to LHtag FIFO 924 a.

Following block 1802, the process proceeds to blocks 1804 and 1806, eachof which represents a condition on the issuance of the combined responseon first bus 1705 by first multiplexer 1704. Block 1804 illustratesmultiplexer 1704 determining if an address tenure is then available inthe outbound first tier link information allocation. That is, the outputof first multiplexer 1704 is timeslice-aligned with the selected linkinformation allocation and will only generate an output during cyclesdesigned to carry a combined response (e.g., cycle 1 or 3 of theembodiment of FIG. 7A or cycle 1 of the embodiment of FIG. 8A). Asfurther illustrated at block 1806, first multiplexer 1704 will also onlyissue a combined response if no competing combined response is inboundon the second tier A link, which is always given priority. Thus, thesecond tier links are guaranteed to be non-blocking with respect toinbound combined responses.

If a negative determination is made at either of blocks 1804-1806, thecombined response is delayed, as indicated at block 1808, until asubsequent cycle during which both of the determinations illustrated atblocks 1804-1806 are positive. If, on the other hand, positivedeterminations are made at both of blocks 1804-1806, the processbifurcates and proceeds to each of blocks 1810 and 1820. Block 1820depicts first multiplexer 1704 broadcasting the request on first bus1705 to each of the outbound X, Y and Z links and to LH combinedresponse launch buffer 1710. Thereafter, the process bifurcates againand passes through page connectors 1822 and 1824 to FIG. 18B, whichillustrates the processing of the combined response at each of the localhubs 100.

Returning to block 1810, response logic 122 also queries LH tag FIFOqueue 924 a via the local master port as shown in FIG. 17, which routesthe master tag read from LH tag FIFO queue 924 a and the associatedcombined response and accumulated partial response to the particular oneof masters 300 identified by the master tag (block 1812). In response toreceipt of the combined response and master tag, the originating master300 processes the combined response, and if the corresponding requestwas a write-type request, the accumulated partial response (block 1814).

For example, if the combined response indicates “success” and thecorresponding request was a read-type request (e.g., a read, DClaim orRWITM request), the originating master 300 may update or prepare toreceive a requested memory block. In this case, the accumulated partialresponse is discarded. If the combined response indicates “success” andthe corresponding request was a write-type request (e.g., a castout,write or partial write request), the originating master 300 extracts thedestination tag field 724 from the accumulated partial response andutilizes the contents thereof as the data tag 714 or 814 used to routethe subsequent data phase of the operation to its destination, asdescribed below with reference to FIGS. 20A-20C. If a “success” combinedresponse indicates or implies a grant of HPC status for the originatingmaster 300, then the originating master 300 will additionally begin toprotect its ownership of the memory block, as depicted at referencenumerals 312 b and 1314. If, however, the combined response received atblock 1814 indicates another outcome, such as “retry”, the originatingmaster 300 may be required to reissue the request. Thereafter, theprocess ends at block 1816.

With reference now to FIG. 18B, an exemplary method of processing acombined response at a local hub 100 that is also the local master 100is illustrated beginning at block 1838, and an exemplary method ofprocessing of a combined response at each of the other local hubs 100 inthe same processing node 202 as the local master 100 is depictedbeginning at block 1830. Turning first to block 1830, combined responsesreceived by a local hub 100 on the inbound X, Y and Z links are receivedby LH combined response launch buffer 1710. As depicted at blocks1832-1834 and in FIG. 17, map logic 1730 within LH combined responselaunch buffer 1710 maps each of the X, Y and Z combined responses to theappropriate one of position-dependent FIFO queues 1740 a-1740 d and thenvalidates the combined responses within position-dependent FIFO queues1740 a-1740 d.

Referring now to block 1838, at the local master/local hub 100, thecombined response on first bus 1705 is fed directly into LH combinedresponse launch buffer 1710. As with the combined responses received onthe X, Y and Z links, map logic 1730 maps the combined response to theappropriate position-dependent FIFO queue 1740 based upon the knownphysical position of the local master/local hub 100. Following themapping by map logic 1730 at block 1840, map logic 1730 validates thecombined response in its position-dependent FIFO queue 1740 as shown atblock 1834.

Following the validation at block 1834, the process then proceeds toblocks 1836, which illustrates arbiter 1752 selecting fromposition-dependent FIFO queues 1740 a-1740 d a combined response forpresentation to second multiplexer 1720. As indicated, the combinedresponse selected by arbiter 1752 is not necessarily the oldest combinedresponse out of all the combined response residing in position-dependentFIFO queues 1740 a-1740 d, but is instead the combined response thatcorresponds to the oldest master tag residing within LH tag FIFO queue924 a. Thus, the combined responses are broadcast in the same relativeorder as the requests.

As depicted at block 1844, arbiter 1752 is time-slice aligned with theselected second tier link information allocation and permits multiplexer1750 to output a combined response only if an address tenure is thenavailable for the combined response in the outbound second tier linkinformation allocation. Thus, for example, arbiter 1752 causesmultiplexer 1750 to output combined responses only during cycle 1 or 3of the embodiment of FIG. 7B or cycle 1 of the embodiment of FIG. 8B. Ifa negative determination is made at blocks 1844, the combined responseis delayed, as indicated at block 1846, until a subsequent cycle duringwhich an address tenure is available. If, on the other hand, a positivedetermination is made at block 1844, arbiter 1752 signals multiplexer1750 to output the selected combined response to an input of secondmultiplexer 1720, which always gives priority to a combined response, ifany, presented by LH combined response launch buffer 1710. Thus, secondmultiplexer 1720 issues the combined response received from LH combinedresponse launch buffer 1710 on second bus 1722. It should be noted thatarbiter 1752 cannot overdrive the output bandwidth since the inboundarrival rate of combined responses is limited by prior request grant.

It should also be noted that the other ports of second multiplexer 1720(e.g., RH, RLX, RLY, and RLZ) could present requests concurrently withLH combined response launch buffer 1710, meaning that the maximumbandwidth of second bus 1722 must equal 10/8 (assuming the embodiment ofFIG. 7B) or 5/6 (assuming the embodiment of FIG. 8B) of the bandwidth ofthe outbound A link in order to keep up with maximum arrival rate.

It should further be observed that only combined responses bufferedwithin local hub combined response launch buffer 1710 are transmitted onthe outbound A link and are required to be aligned with address tenureswithin the link information allocation. Because all other combinedresponses competing for issuance by second multiplexer 1720 target onlythe local snoopers 304 and their respective FIFO queues rather than theoutbound A link, such combined responses may be issued in the remainingcycles of the information frames. Consequently, regardless of theparticular arbitration scheme employed by second multiplexer 1720, allcombined responses concurrently presented to second multiplexer 1720 areguaranteed to be transmitted within the latency of a single informationframe.

Following the issuance of the combined response on second bus 1722, theprocess bifurcates and proceeds to each of blocks 1848 and 1852. Block1848 depicts routing the combined response to the outbound A link fortransmission to the remote hub 100. Thereafter, the process proceedsthrough page connector 1850 to FIG. 18C, which depicts an exemplarymethod of combined response processing at a remote hub 100.

Referring now to block 1852, the combined response issued on second bus1722 is also utilized to query LH tag FIFO queue 924 a to obtain themaster tag from the oldest entry therein. Thereafter, LH tag FIFO queue924 a deallocates the entry allocated to the operation, ending tenure1302 (block 1854). The combined response and the associated master tagare then routed to the snoopers 304 within the local hub 100, as shownat block 1856. In response to receipt of the combined response, snoopers304 process the combined response and perform any operation required inresponse thereto (block 1858). For example, a snooper 304 may source arequested memory block to the master 300 of the request, invalidate acached copy of the requested memory block, etc. If the combined responseincludes an indication that the snooper 304 is to transfer ownership ofthe memory block to the requesting master 300, snooper 304 ends itsprotection window 312 a. Thereafter, combined response phase processingat the local hub 100 ends at block 1859.

Referring now to FIG. 18C, there is depicted a high level logicalflowchart of an exemplary method of combined response phase processingat a remote hub 100 in accordance with the present invention. Asdepicted, the process begins at page connector 1860 upon receipt of acombined response at a remote hub 100 on its inbound A link. As notedabove, the inbound A link is non-blocking, meaning that after the framecontaining a combined response is latched into a hold buffer 1702 asshown at block 1862, the combined response is transmitted by firstmultiplexer 1704 as soon as the next address tenure is available in thefirst tier link information allocation. Thus, as shown at block 1864, ifno address tenure is available, first multiplexer 1704 waits for thenext address tenure as shown at block 1866. If an address tenure isavailable, the process proceeds from block 1864 to block 1868, whichillustrates first multiplexer 1704 broadcasting the combined response onfirst bus 1705 to the outbound X, Y and Z links and RH hold buffer 1706.

Following block 1868, the process bifurcates. A first path passesthrough page connector 1870 to FIG. 18D, which illustrates an exemplarymethod of combined response phase processing at the remote leaves 100.The second path from block 1868 proceeds to block 1874, whichillustrates the second multiplexer 1720 determining which of thecombined responses presented at its inputs to output on second bus 1722.As indicated, second multiplexer 1720 prioritizes local hub combinedresponses over remote hub combined responses, which are in turnprioritized over combined responses buffered in remote leaf buffers 1714a-1714 c. Thus, if a local hub combined response is presented forselection by LH combined response launch buffer 1710, the combinedresponse buffered within remote hub buffer 1706 is delayed, as shown atblock 1876. If, however, no combined response is presented by LHcombined response launch buffer 1710, second multiplexer 1720 issues thecombined response from remote hub buffer 1706 on second bus 1722.

In response to detecting the combined response on second bus 1722, RH[L]tag FIFO 924 b reads out the master tag specified by the relevantrequest from the master tag field 1100 of the relevant entries, asdepicted at block 1878, and then deallocates the entry, ending tenure1306 (block 1880). The combined response and the master tag are furtherrouted to the snoopers 304 in the remote hub 100, as shown at block1882. In response to receipt of the combined response, the snoopers 304process the combined response (block 1884) and perform any requiredoperations, as discussed above. If the combined response includes anindication that the snooper 304 is to transfer ownership of the memoryblock to the requesting master 300, the snooper 304 ends its protectionwindow 312 a. Thereafter, combined response phase processing at theremote hub ends at block 1886.

With reference now to FIG. 18D, there is illustrated a high levellogical flowchart of an exemplary method of combined response phaseprocessing at a remote leaf 100 in accordance with the presentinvention. As shown, the process begins at page connector 1888 uponreceipt of a combined response at the remote leaf 100 on one of itsinbound X, Y and Z links. As indicated at block 1890, the combinedresponse is latched into one of RL hold buffers 1714 a-1714 c. Next, asdepicted at block 1891, the combined response is evaluated by secondmultiplexer 1720 together with the other combined responses presented toits inputs. As discussed above, second multiplexer 1720 prioritizeslocal hub combined responses over remote hub combined responses, whichare in turn prioritized over combined responses buffered in remote leafbuffers 1714 a-1714 c. Thus, if a local hub or remote hub combinedresponse is presented for selection, the combined response bufferedwithin the RL hold buffer 1714 is delayed, as shown at block 1892. If,however, no higher priority combined response is presented to secondmultiplexer 1720, second multiplexer 920 issues the combined responsefrom the RL hold buffer 1714 on second bus 1722.

In response to detecting the combined response on second bus 1722, theRL tag FIFO 924 c-924 e associated with the inbound first tier link onwhich the combined response was received reads out from the master tagfield 1100 of one of its entries the master tag specified by theassociated request, as depicted at block 1893, and then deallocates theentry, ending tenure 1310 (block 1894). The combined response and themaster tag are further routed to the snoopers 304 in the remote leaf100, as shown at block 1895. In response to receipt of the combinedresponse, the snoopers 304 process the combined response (block 1896)and perform any required operations, as discussed above. If the combinedresponse includes an indication that the snooper 304 is to transferownership of the memory block to the requesting master 300, snooper 304ends its protection window 312 a (and protection window 1312 of FIG.13). Thereafter, combined response phase processing at the remote leaf100 ends at block 1897.

Referring now to FIG. 19, there is depicted a block diagram of anexemplary embodiment of data logic 121 d within interconnect logic 120.As shown, data logic 121 d includes a second tier link FIFO queue 1910for buffering in arrival order data tags and tenures received on thein-bound A link, as well as an outbound XYZ switch 1906, coupled to theoutput of second tier link FIFO queue 1910, for routing data tags andtenures to outbound first tier X, Y and Z links. In addition, data logic121 d includes first tier link FIFO queues 1912 a-1912 c, which are eachcoupled to a respective one of the inbound X, Y and Z links to queue inarrival order inbound data tags and tenures, and an outbound A switch1908, coupled to the outputs of first tier link FIFO queues 1912 a-1912c, for routing data tags and tenures to outbound A links. Data logic 121d further includes an m:n data multiplexer 1904, which outputs data fromone or more selected data sources 1900 (e.g., data sources within L2cache array 114, IMC 124 and I/O controller 128) to outbound XYZ switch1906, data sinks 1902 (e.g., data sinks within L2 cache array 114, IMC124 and I/O controller 128), and/or outbound A switch 1908 under thecontrol of arbiter 1905. Data sinks 1902 are further coupled to receivedata from the inbound X, Y, Z and A links. The operation of data logic121 d is described below with reference to FIGS. 20A-20C, whichrespectively depict data phase processing at the processing unitcontaining the data source, at a processing unit receiving data fromanother processing unit in its processing node, and at a processing unitreceiving data from a processing unit in another processing node.

Referring now to FIG. 20A, there is depicted a high level logicalflowchart of an exemplary method of data phase processing at a sourceprocessing unit 100 containing the data source 1900 that initiatestransmission of data. The source processing unit 100 may be the localmaster, local hub, remote hub or remote leaf with respect to the requestwith which the data transfer is associated. In the depicted method,decision blocks 2002, 2004, 2010, 2020, 2022, 2030, 2032, 2033, 2034,2040 and 2042 all represent determinations made by arbiter 1905 inselecting the data source(s) 1900 that will be permitted to transmitdata via multiplexer 1904 to outbound XYZ switch 1906, data sinks 1902,and/or outbound A switch 1908.

As shown, the process begins at block 2000 and then proceeds to block2002, which illustrates arbiter 1905 determining whether or not a datatenure is currently available. For example, in the embodiment of FIGS.7A-7B, the data tenure includes the data tag 714 of cycle 2 and the datapayload of cycles 4-7. Alternatively, in the embodiment of FIGS. 8A-8B,the data tenure includes the data tag 814 of cycle 0 and the datapayload of cycles 2-5. If a data tenure is not currently available, datatransmission must wait, as depicted at block 2006. Thereafter, theprocess returns to block 2002.

Referring now to block 2004, assuming the presence of multiple datasources 1900 all contending for the opportunity to transmit data,arbiter 1905 further selects one or more “winning” data sources 1900that are candidates to transmit data from among the contending datasources 1900. In a preferred embodiment, the “winning” data source(s)1900 are permitted to output data on up to all of the X, Y, Z and Alinks during each given link information allocation frame. A data source1900 that is contending for an opportunity to transmit data and is notselected by arbiter 1905 in the current link information allocationframe must delay transmission of its data until a subsequent frame, asindicated by the process returning to blocks 2006 and 2002.

Referring now to blocks 2010, 2020, and 2030, arbiter 1905 examines thedata tag presented by a “winning” data source 1900 to identify adestination for its data. For example, the data tag may indicate, forexample, a destination processing node 202, a destination processingunit 100 (e.g., by S, T, U, V position), a logic unit within thedestination processing unit 100 (e.g., L2 cache masters 112), and aparticular state machine (e.g., a specific data sink 1902) within thelogic unit. By examining the data tag in light of the known topology ofdata processing system 200, arbiter 1905 can determine whether or notthe source processing unit 100 is the destination processing unit 100(block 2010), within the same processing node 202 as the destinationprocessing unit 100 (block 2020), or directly coupled to the destinationprocessing node 202 by a second tier link (block 2030). Based upon thisexamination, arbiter 1905 can determine whether or not the resource(s)required to transmit a “winning” data source's data are available.

For example, if the source processing unit 100 is the destinationprocessing unit 100 (block 2010), there is no resource constraint on thedata transmission, and arbiter 1905 directs multiplexer 1904 to routethe data and associated data tag to the local data sinks 1902 forprocessing by the indicated data sink 1902 (block 2012). Thereafter, theprocess ends at block 2014.

If, however, the source processing unit 100 is not the destinationprocessing unit 100 but the source processing node 202 is thedestination processing node 202 (block 2020), arbiter 1905 determines atblock 2022 whether outbound XYZ switch 1906 is available to handle aselected data transmission. If not, the process passes to block 2006,which has been described. If, however, outbound XYZ switch 1906 isavailable to handle a selected data transmission, arbiter 1905 directsmultiplexer 1904 to route the data to outbound XYZ switch 1906 fortransmission to the destination processing unit 100 identified by thedata tag, which by virtue of the determination at block 2020 is directlyconnected to the present processing unit 100 by a first tier link (block2024). Thereafter, the process proceeds through page connector 2026 toFIG. 20B, which illustrates processing of the data at the destinationprocessing unit 100. It should also be noted by reference to block 2020,2022 and 2024 that data transmission by a source processing unit 100 toany destination processing unit 100 in the same processing node 202 isnon-blocking and not subject to any queuing or other limitation by thedestination processing unit 100.

Referring now to block 2030, arbiter 1905 determines whether or not thesource processing unit 100 is directly connected to a destinationprocessing node 202 by a second tier (A) link. Assuming the topologyconstruction rule set forth previously, this determination can be madeby determining whether the index assigned to the source processing unit100 matches the index assigned to the destination processing node 202.If the source processing unit 100 is not directly connected to adestination processing node 202 by a second tier (A) link, the sourceprocessing unit 100 must transmit the data tenure to the destinationprocessing node 202 via an intermediate hub 100 in the same processingnode 202 as the source processing unit 100. This data transmission issubject to two additional constraints depicted at blocks 2040 and 2042.

First, as illustrated at block 2040, outbound XYZ switch 1906 must beavailable to handle the data transmission. Second, as depicted at block2042, the intermediate hub 100 must have an entry available in therelevant one of its FIFO queues 1912 a-1912 c to receive the datatransmission. As noted briefly above, in a preferred embodiment, sourceprocessing unit 100 tracks the availability of queue entries at theintermediate hub 100 based upon data tokens 715 or 815 transmitted fromthe intermediate hub 100 to the source processing unit 100. If either ofthe criteria depicted at blocks 2040 and 2042 is not met, the processpasses to block 2006, which is described above. If, however, bothcriteria are met, arbiter 1905 directs multiplexer 1904 to route thedata tenure to outbound XYZ switch 1906 for transmission to theintermediate hub 100 (block 2044). Thereafter, the process proceedsthrough page connector 2046 to FIG. 20B.

Returning to block 2030, if arbiter 1905 determines that the sourceprocessing unit 100 is directly connected to a destination processingnode 202 by a second tier (A) link, the data transmission is againconditioned on the availability of resources at one or both of thesource processing unit 100 and the receiving processing unit 100. Inparticular, as shown at block 2032, outbound A switch 1908 must beavailable to handle the data transmission. In addition, as indicated atblocks 2033 and 2034, the data transmission may be dependent uponwhether a queue entry is available for the data transmission in the FIFOqueue 1910 of the downstream processing unit 100. That is, if the sourceprocessing unit 100 is directly connected to the destination processingunit 100 (e.g., as indicated by the index of the destination processingunit 100 having the same index than the source processing node 202),data transmission by the source processing unit 100 to the destinationprocessing unit 100 is non-blocking and not subject to any queuing orother limitation by the destination processing unit 100. If, however,the source processing unit 100 is connected to the destinationprocessing unit 100 via an intermediate hub 100 (e.g., as indicated bythe index of the destination processing unit 100 having a differentindex than the source processing node 202), the intermediate hub 100must have an entry available in its FIFO queue 1910 to receive the datatransmission. The availability of a queue entry in FIFO queue 1910 isindicated to the source processing unit 100 by data tokens 715 or 815received from the intermediate hub 100.

Assuming the condition depicted at block 2032 is met and, if necessary(as determined by block 2033), the condition illustrated at block 2034is met, the process passes to block 2036. Block 2036 depicts arbiter1905 directing multiplexer 1904 to route the data tag and data tenure tooutbound A switch 1908 for transmission to the intermediate hub 100.Thereafter, the process proceeds through page connector 2038 to FIG.20C, which is described below. In response to a negative determinationat either of blocks 2032 and 2034, the process passes to block 2006,which has been described.

With reference now to FIG. 20B, there is illustrated a high levellogical flowchart of an exemplary method of data phase processing at aprocessing unit 100 receiving data from another processing unit 100 inthe same processing node 202. As depicted, the process begins at block2050 in response to receipt of a data tag on one of the inbound firsttier X, Y and Z links. In response to receipt of the data tag,unillustrated steering logic within data logic 121 d examines the datatag at block 2052 to determine if the processing unit 100 is thedestination processing unit 100. If so, the steering logic routes thedata tag and the data tenure following the data tag to the local datasinks 1902, as shown at block 2054. Thereafter, data phase processingends at block 2056.

If, however, the steering logic determines at block 2052 that thepresent processing unit 100 is not the destination processing unit 100,the process passes to block 2060. Block 2060 depicts buffering the datatag and data tenure within the relevant one of FIFO queues 1912 a-1912 cuntil the data tag and data tenure can be forwarded via the outbound Alink. As illustrated at block 2062, 2064 and 2066, the data tag and datatenure can be forwarded only when outbound A switch 1908 is available tohandle the data transmission (block 2062) and the downstream processingunit 100 has an entry available in its FIFO queue 1910 to receive thedata transmission (as indicated by data tokens 715 or 815 received fromthe downstream processing unit 100). When the conditions illustrated atblock 2062 and 2066 are met concurrently, the entry in FIFO queue 1912allocated to the data tag and data tenure is freed (block 2068), and adata token 715 or 815 is transmitted to the upstream processing unit 100to indicate that the entry in FIFO queue 1912 is available for reuse. Inaddition, outbound A switch 1908 routes the data tag and data tenure tothe outbound A link (block 2070). Thereafter, the process proceedsthrough page connector 2072 to FIG. 20C.

Referring now to FIG. 20C, there is depicted a high level logicalflowchart of an exemplary method of data phase processing at aprocessing unit 100 receiving data from a processing unit 100 in anotherprocessing node 202. As depicted, the process begins at block 2080 inresponse to receipt of a data tag on the inbound second tier A link. Inresponse to receipt of the data tag, unillustrated steering logic withindata logic 121 d examines the data tag at block 2082 to determine if thepresent processing unit 100 is the destination processing unit 100. Ifso, the steering logic routes the data tag and the data tenure followingthe data tag to the local data sinks 1902, as shown at block 2084.Thereafter, data phase processing ends at block 2086.

If, however, the steering logic determines at block 2082 that thepresent processing unit 100 is not the destination processing unit 100,the process passes to block 2090. Block 2090 depicts buffering the datatag and data tenure within FIFO queue 1910 until the data tag and datatenure can be forwarded via the appropriate one of the outbound X, Y andZ links. As illustrated at block 2092 and 2094, the data tag and datatenure can be forwarded only when outbound XYZ switch 1906 is availableto handle the data transmission (block 2092). When the conditionillustrated at block 2092 is met, the entry in FIFO queue 1910 allocatedto the data tag and data tenure is freed (block 2097), and a data token715 or 815 is transmitted to the upstream processing unit 100 via the Alink to indicate that the entry in FIFO queue 1910 is available forreuse. In addition, outbound XYZ switch 1906 routes the data tag anddata tenure to the relevant one of the outbound X, Y and Z links (block2098). Thereafter, the process proceeds through page connector 2099 toFIG. 20B, which has been described.

As has been described, the present invention provides an improvedprocessing unit, data processing system and interconnect fabric for adata processing system. The inventive data processing system topologydisclosed herein increases in interconnect bandwidth with system scale.In addition, a data processing system employing the topology disclosedherein may also be hot upgraded (i.e., processing nodes may be addedduring operation), downgraded (i.e., processing nodes may be removed),or repaired without disruption of communication between processing unitsin the resulting data processing system through the connection,disconnection or repair of individual processing nodes. The dataprocessing system topology described herein also permits the latency ofoperation phases on the interconnect fabric to remain bounded regardlessof system scale. For example, the request latency beginning with block1444 and ending at each of blocks 1454, 1482, and 1495 for the localhubs, remote hubs and remote leaves, respectively, is bounded, as shownin FIG. 13. In addition, the combined response latency beginning atblock 1802 and ending at each of blocks 1814, 1858, 1884 and 1896 forthe local master, local hubs, remote hubs and remote leaves,respectively, is similarly bounded. Consequently, increases in systemscale do not modify latencies in a manner that violates the timeconstraint required for correctness.

While the invention has been particularly shown as described withreference to a preferred embodiment, it will be understood by thoseskilled in the art that various changes in form and detail may be madetherein without departing from the spirit and scope of the invention.For example, although the present invention discloses preferredembodiments in which FIFO queues are utilized to order operation-relatedtags and partial responses, those skilled in the art will appreciatedthat other ordered data structures may be employed to maintain an orderbetween the various tags and partial responses of operations in themanner described. In addition, although preferred embodiments of thepresent invention employ uni-directional communication links, thoseskilled in the art will understand by reference to the foregoing thatbi-directional communication links could alternatively be employed.

1. A data processing system, comprising: a plurality of processing unitscoupled for communication by a communication link; a configurationregister having a plurality of different settings each corresponding toa respective one of a plurality of different link informationallocations, wherein information is communicated over said communicationlink in accordance with a particular link information allocation amongsaid plurality of link information allocations determined by arespective setting of said configuration register.
 2. The dataprocessing system of claim 1, wherein said selected link informationallocation allocates bandwidth of said communication link in a repeatingsequence of fixed-length, multi-cycle frames.
 3. The data processingsystem of claim 2, wherein a frame of a first link informationallocation among said plurality of link information allocations includesmultiple address tenures per data tenure.
 4. The data processing systemof claim 2, wherein a frame of a first link information allocation amongsaid plurality of link information allocations includes payload data anda data tag identifying a destination of said payload data, wherein saiddata tag precedes said payload data by more than one cycle.
 5. The dataprocessing system of claim 2, wherein a frame of a first linkinformation allocation includes a transaction type field for indicatinga type of a request, a master tag field for identifying a master of saidrequest, an address field for identifying a request address for saidrequest, and a response field for providing a response to a previousrequest.
 6. The data processing system of claim 1, and furthercomprising: means for automatically setting said configuration registerin response to detecting a type of workload.
 7. A processing unit,comprising: a processor core; interconnect logic coupled to acommunication link coupling the processing unit to another processingunit for communication; and a configuration register having a pluralityof different settings each corresponding to a respective one of aplurality of different link information allocations, wherein informationis communicated over said communication link in accordance with aparticular link information allocation among said plurality of linkinformation allocations determined by a respective setting of saidconfiguration register.
 8. The processing unit of claim 7, wherein saidselected link information allocation allocates bandwidth of saidcommunication link in a repeating sequence of fixed-length, multi-cycleframes.
 9. The processing unit of claim 8, wherein a frame of a firstlink information allocation among said plurality of link informationallocations includes multiple address tenures per data tenure.
 10. Theprocessing unit of claim 8, wherein a frame of a first link informationallocation among said plurality of link information allocations includespayload data and a data tag identifying a destination of said payloaddata, wherein said data tag precedes said payload data by more than onecycle.
 11. The processing unit of claim 8, wherein a frame of a firstlink information allocation includes a transaction type field forindicating a type of a request, a master tag field for identifying amaster of said request, an address field for identifying a requestaddress for said request, and a response field for providing a responseto a previous request.
 12. The processing unit of claim 7, and furthercomprising: means for automatically setting said configuration registerin response to detecting a type of workload.
 13. A method of dataprocessing, comprising: in a configuration register having a pluralityof different settings each corresponding to a respective one of aplurality of different link information allocations for a communicationlink coupling processing units of a data processing system, setting saidconfiguration register to a particular setting among said plurality ofsettings; and communicating information over said communication link inaccordance with a selected link information allocation among saidplurality of link information allocations determined by said particularsetting of said configuration register.
 14. The method of claim 13,wherein said selected link information allocation allocates bandwidth ofsaid communication link in a repeating sequence of fixed-length,multi-cycle frames.
 15. The method of claim 14, wherein a frame of afirst link information allocation among said plurality of linkinformation allocations includes multiple address tenures per datatenure.
 16. The method of claim 14, wherein a frame of a first linkinformation allocation among said plurality of link informationallocations includes payload data and a data tag identifying adestination of said payload data, wherein said data tag precedes saidpayload data by more than one cycle.
 17. The method of claim 14, whereina frame of a first link information allocation includes a transactiontype field for indicating a type of a request, a master tag field foridentifying a master of said request, an address field for identifying arequest address for said request, and a response field for providing aresponse to a previous request.
 18. The method of claim 13, and furthercomprising: said data processing system automatically setting saidconfiguration register in response to detecting a type of workload.